Welcome to The Great Handover. In CPU programming, we define how to iterate; in GPGPU, we define what an iteration looks like. This shift from instruction-centric to data-centric logic is powered by the Kernel Abstraction.
1. The __global__ Blueprint
By using the __global__ qualifier, you are not writing a function—you are designing a scalable blueprint. A single kernel execution represents one standalone unit of work, allowing the GPU to orchestrate thousands of identical tasks across its massive core count without manual thread management.
2. The Global Address Resolver
How does a single thread among millions find its target? It uses a deterministic contract known as the indexing formula:
$$\text{threadID} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$$
This formula acts as a coordinate system, bridging the software's logical data (the array) to the hardware's physical hierarchy (blocks and threads).
3. Execution Configuration
The <<<B, T>>> parameters define the grid shape. This ensures Transparent Scalability: your code runs identical logic whether the hardware has 2 SMs or 80 SMs.